|
| cdf =| mean =| median =| mode =| variance =| skewness =| kurtosis =| entropy =| mgf =| char =| }} Zipf's law , an empirical law formulated using mathematical statistics, refers to the fact that many types of data studied in the physical and social sciences can be approximated with a Zipfian distribution, one of a family of related discrete power law probability distributions. The law is named after the American linguist George Kingsley Zipf (1902–1950), who popularized it and sought to explain it (Zipf 1935, 1949), though he did not claim to have originated it. The French stenographer Jean-Baptiste Estoup (1868–1950) appears to have noticed the regularity before Zipf.〔Christopher D. Manning, Hinrich Schütze ''Foundations of Statistical Natural Language Processing'', MIT Press (1999), ISBN 978-0-262-13360-9, p. 24〕 It was also noted in 1913 by German physicist Felix Auerbach〔Auerbach F. (1913) Das Gesetz der Bevölkerungskonzentration. Petermann’s Geographische Mitteilungen 59, 74–76〕 (1856–1933). ==Motivation== Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation. For example, in the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.〔. (P. 139 ): "For example, in the Brown Corpus, consisting of over one million words, half of the word volume consists of repeated uses of only 135 words."〕 The same relationship occurs in many other rankings unrelated to language, such as the population ranks of cities in various countries, corporation sizes, income rankings, ranks of number of people watching the same TV channel,〔M. Eriksson, S.M. Hasibur Rahman, F. Fraille, M. Sjöström, ”(Efficient Interactive Multicast over DVB-T2 - Utilizing Dynamic SFNs and PARPS )”, 2013 IEEE International Conference on Computer and Information Technology (BMSB’13), London, UK, June 2013. Suggests a heterogeneous Zipf-law TV channel-selection model〕 and so on. The appearance of the distribution in rankings of cities by population was first noticed by Felix Auerbach in 1913.〔Auerbach F. (1913) Das Gesetz der Bevölkerungskonzentration. Petermann’s Geographische Mitteilungen 59, 74–76〕 Empirically, a data set can be tested to see whether Zipf's law applies by checking the goodness of fit of an empirical distribution to the hypothesized power law distribution with a Kolmogorov-Smirnov test, and then comparing the (log) likelihood ratio of the power law distribution to alternative distributions like an exponential distribution or lognormal distribution.〔Clauset, A., Shalizi, C. R., & Newman, M. E. J. (2009). Power-Law Distributions in Empirical Data. SIAM Review, 51(4), 661–703. doi:10.1137/070710111〕 When Zipf's law is checked for cities, a better fit has been found with ''b'' = 1.07; i.e. the largest settlement is the size of the largest settlement. While Zipf's law holds for the upper tail of the distribution, the entire distribution of cities is log-normal and follows Gibrat's law.〔Eeckhout J. (2004), Gibrat's law for (All) Cities. American Economic Review 94(5), 1429-1451.〕 Both laws are consistent because a log-normal tail can typically not be distinguished from a Pareto (Zipf) tail. 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Zipf's law」の詳細全文を読む スポンサード リンク
|